Add two regression datasets: California Housing and Diabetes #39

Shu-Wan · 2026-02-06T01:31:37Z

Summary

This PR adds two classic regression datasets from sklearn to CausalBench for demo and testing purposes.

Datasets Added

1. California Housing Dataset

Samples: 20,640
Features: 9 (MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal)
Task: Regression - predict median house values in California districts
Source: sklearn.datasets.fetch_california_housing

2. Diabetes Dataset

Samples: 442
Features: 11 (age, sex, bmi, bp, s1-s6, target)
Task: Regression - predict disease progression from physiological variables
Source: sklearn.datasets.load_diabetes

Changes

Dataset Files

Each dataset includes:

CSV data file with all features and target
config.yaml configuration file following CausalBench schema
download_data.py script to regenerate data from sklearn

Deliverables

✅ Two regression datasets in causalbench-asu/tests/data/
✅ Compressed .zip files for each dataset
✅ Updated README.md with dataset information

Design Decisions

No causal adjacency matrices: These are only required for causal discovery tasks, not regression tasks
Classic sklearn datasets: Well-defined, documented, appropriate size for demos
Standalone regression tasks: Configured explicitly as regression tasks in descriptions

Testing

All datasets successfully load through the CausalBench framework:

✅ California Housing Dataset: PASSED
✅ Diabetes Dataset: PASSED

Total: 2/2 tests passed

kapkic · 2026-02-06T15:36:05Z

LGTM

Copilot

Pull request overview

This PR adds two sklearn-sourced regression datasets (California Housing and Diabetes) to the CausalBench test data bundle for demos/testing, along with lightweight download scripts and README documentation updates.

Changes:

Added california_housing dataset config + regeneration script (and accompanying data/zip artifacts).
Added diabetes dataset config + CSV + regeneration script (and accompanying zip artifact).
Updated README dataset table; minor formatting cleanup in zip_files.py.

Reviewed changes

Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
causalbench-asu/tests/zip_files.py	Minor formatting / quoting updates for zip utility.
causalbench-asu/tests/data/diabetes/download_data.py	Script to regenerate the Diabetes CSV from sklearn.
causalbench-asu/tests/data/diabetes/diabetes_data.csv	Added Diabetes dataset CSV.
causalbench-asu/tests/data/diabetes/config.yaml	Added dataset config for Diabetes.
causalbench-asu/tests/data/diabetes.zip	Added packaged dataset zip.
causalbench-asu/tests/data/california_housing/download_data.py	Script to regenerate the California Housing CSV from sklearn.
causalbench-asu/tests/data/california_housing/config.yaml	Added dataset config for California Housing.
README.md	Updated dataset list table to include the new datasets + minor whitespace cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T18:20:20Z

causalbench-asu/tests/data/diabetes/config.yaml

+                type: continuous
+                data: decimal
+            sex:
+                header: sex
+                type: continuous
+                data: decimal
+            bmi:
+                header: bmi
+                type: continuous
+                data: decimal
+            bp:
+                header: bp
+                type: continuous
+                data: decimal
+            s1:
+                header: s1
+                type: continuous
+                data: decimal
+            s2:
+                header: s2
+                type: continuous
+                data: decimal
+            s3:
+                header: s3
+                type: continuous
+                data: decimal
+            s4:
+                header: s4
+                type: continuous
+                data: decimal
+            s5:
+                header: s5
+                type: continuous
+                data: decimal
+            s6:
+                header: s6
+                type: continuous
+                data: decimal
+            target:
+                header: target
+                type: continuous


The column type values are set to continuous, but other dataset configs in this repo use values like ratio/nominal (e.g., tests/data/panama/config.yaml). If the dataset loader only recognizes the existing enum values, continuous will fail schema validation or parsing. Please align these column type fields with the accepted values used elsewhere (e.g., use ratio for numeric continuous variables) or update the loader/schema to explicitly support continuous.

Suggested change

type: continuous

data: decimal

sex:

header: sex

type: continuous

data: decimal

bmi:

header: bmi

type: continuous

data: decimal

bp:

header: bp

type: continuous

data: decimal

s1:

header: s1

type: continuous

data: decimal

s2:

header: s2

type: continuous

data: decimal

s3:

header: s3

type: continuous

data: decimal

s4:

header: s4

type: continuous

data: decimal

s5:

header: s5

type: continuous

data: decimal

s6:

header: s6

type: continuous

data: decimal

target:

header: target

type: continuous

type: ratio

data: decimal

sex:

header: sex

type: ratio

data: decimal

bmi:

header: bmi

type: ratio

data: decimal

bp:

header: bp

type: ratio

data: decimal

s1:

header: s1

type: ratio

data: decimal

s2:

header: s2

type: ratio

data: decimal

s3:

header: s3

type: ratio

data: decimal

s4:

header: s4

type: ratio

data: decimal

s5:

header: s5

type: ratio

data: decimal

s6:

header: s6

type: ratio

data: decimal

target:

header: target

type: ratio

Actually, this is important. @prat-man @Shu-Wan I don't recall we do have any "type" for datasets. We may need to remove them. Please confirm.

type is not used atm
see

CausalBench/causalbench-asu/causalbench/modules/dataset.py

Lines 76 to 112 in a95d5cc

for column, col_data in data.columns.items():

if data.headers:

col_df = file_df[col_data.header]

else:

col_df = file_df[col_data._index]

if col_data.data == 'integer':

if not pd.api.types.is_integer_dtype(col_df):

raise TypeError(f'Data type mismatch for column {column}')

if 'labels' in col_data:

labels = sorted(col_data.labels)

data_labels = sorted(file_df[col_data.header].unique())

if labels != data_labels:

raise ValueError(f'Labels do not match for column {column}')

if 'range' in col_data:

start = col_data.range.start

end = col_data.range.end

min1 = min(file_df[col_data.header])

max1 = max(file_df[col_data.header])

if not (start <= min1 <= end and start <= max1 <= end):

raise ValueError(f'Range does not match for column {column}')

elif col_data.data == 'decimal':

if not pd.api.types.is_float_dtype(col_df):

raise TypeError(f'Data type mismatch for column {column}')

if 'labels' in col_data:

labels = sorted(col_data.labels)

data_labels = sorted(file_df[col_data.header].unique())

if labels != data_labels:

raise ValueError(f'Labels do not match for column {column}')

if 'range' in col_data:

start = col_data.range.start

end = col_data.range.end

min1 = min(file_df[col_data.header])

max1 = max(file_df[col_data.header])

if not (start <= min1 <= end and start <= max1 <= end):

raise ValueError(f'Range does not match for column {column}')

existing configs either leave it blank or set it as ratio or nominal

based on schema, it can be defined quite arbitrarily (

CausalBench/causalbench-asu/causalbench/modules/schema/dataset.yaml

Lines 67 to 70 in a95d5cc

type:

anyOf:

- type: string

- type: 'null'

)

I change it to blank

@kapkic @prat-man

causalbench-asu/tests/data/california_housing/config.yaml

README.md

update zip_files update docs

add two regression datasets

0c91ed3

Shu-Wan assigned prat-man and kapkic Feb 6, 2026

Shu-Wan marked this pull request as ready for review February 6, 2026 18:15

Copilot AI review requested due to automatic review settings February 6, 2026 18:15

Copilot started reviewing on behalf of Shu-Wan February 6, 2026 18:15 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

update configs

2c5a560

update zip_files update docs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add two regression datasets: California Housing and Diabetes #39

Add two regression datasets: California Housing and Diabetes #39

Uh oh!

Shu-Wan commented Feb 6, 2026 •

edited

Loading

Uh oh!

kapkic commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

kapkic Feb 8, 2026

Uh oh!

Shu-Wan Feb 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	for column, col_data in data.columns.items():
	if data.headers:
	col_df = file_df[col_data.header]
	else:
	col_df = file_df[col_data._index]

	if col_data.data == 'integer':
	if not pd.api.types.is_integer_dtype(col_df):
	raise TypeError(f'Data type mismatch for column {column}')
	if 'labels' in col_data:
	labels = sorted(col_data.labels)
	data_labels = sorted(file_df[col_data.header].unique())
	if labels != data_labels:
	raise ValueError(f'Labels do not match for column {column}')
	if 'range' in col_data:
	start = col_data.range.start
	end = col_data.range.end
	min1 = min(file_df[col_data.header])
	max1 = max(file_df[col_data.header])
	if not (start <= min1 <= end and start <= max1 <= end):
	raise ValueError(f'Range does not match for column {column}')

	elif col_data.data == 'decimal':
	if not pd.api.types.is_float_dtype(col_df):
	raise TypeError(f'Data type mismatch for column {column}')
	if 'labels' in col_data:
	labels = sorted(col_data.labels)
	data_labels = sorted(file_df[col_data.header].unique())
	if labels != data_labels:
	raise ValueError(f'Labels do not match for column {column}')
	if 'range' in col_data:
	start = col_data.range.start
	end = col_data.range.end
	min1 = min(file_df[col_data.header])
	max1 = max(file_df[col_data.header])
	if not (start <= min1 <= end and start <= max1 <= end):
	raise ValueError(f'Range does not match for column {column}')

Add two regression datasets: California Housing and Diabetes #39

Are you sure you want to change the base?

Add two regression datasets: California Housing and Diabetes #39

Uh oh!

Conversation

Shu-Wan commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Datasets Added

1. California Housing Dataset

2. Diabetes Dataset

Changes

Dataset Files

Deliverables

Design Decisions

Testing

Uh oh!

kapkic commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

kapkic Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Shu-Wan Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shu-Wan commented Feb 6, 2026 •

edited

Loading